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Abstract 

Many natural signals exhibit a sparse representation, whenever a suitable describing model is given. 
Here, a linear generative model is considered, where many sparsity-based signal processing techniques 
rely on such a simplified model. As this model is often unknown for many classes of the signals, we 
need to select such a model based on the domain knowledge or using some exemplar signals. This paper 
presents a new exemplar based approach for the linear model (called the dictionary) selection, for such 
sparse inverse problems. The problem of dictionary selection, which has also been called the dictionary 
learning in this setting, is first reformulated as a joint sparsity model. The joint sparsity model here differs 
from the standard joint sparsity model as it considers an overcompleteness in the representation of each 
signal, within the range of selected subspaces. The new dictionary selection paradigm is examined with 
some synthetic and reahstic simulations. 

I. Introduction 

The sparse signal model is one the most successful low-dimensional signal models for modem signal 
processing applications |lj]. In this model, any considered signal y G W", can be represented as the sum 
of a few elementary functions, called the atoms, plus some noise n G M™, as follows, 

y = Dx + n, 
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where D G MJ^^p, called the dictionary, is the collection of the atoms and x G is a sparse vector. In 
this setting, y is often called a sparse signal in D. The additive noise is used to consider the inaccuracy of 
the measurement device or the model mismatch. While choosing an overcomplete dictionary, i.e. p > m, 
gives us a flexibility to choose sparser representation, the extra redundancy can be damaging in ducking 
failures coefficient recovery. Therefore, the success of sparse signal models depends on how well we 
choose a redundant D, which is the main focus of this paper. 

There is a lot of interest in building redundant dictionaries to make more flexible models and various 
techniques have already been proposed to design the dictionary using some domain knowledge, see 
for example ||21, or learning the dictionary using a given set of exemplars [3], see H and ||5l for a 
more complete review on different dictionary selection techniques. The advantage of the first approach 
is the possibility of incorporating already known signal structures and often fast implementation of the 
dictionary. The second approach does not need such prior information about the signals, but they often find 
an unstructured dictionary with a computationally expensive implementation. We will combine these two 
methods in this paper, by considering a large set of potentially good atoms $ € R™^", n > p, called a 
mother dictionary here, and selecting a smaller set of atoms as the final dictionary D. Fast implementation 
of such dictionaries are guaranteed, if the mother dictionary has such a property. For instance, scalar 
products of a given signal x with a family of Gabor atoms of length m can be implemented with a 
computational complexity of 0{m\ogm). Also, as we restrict the search space to the dictionaries with 
mother atoms, it can be learned using much less exemplars. In other words, restricting the dictionary to 
a subset of mother atoms, regularises the dictionary learning problem and avoids overfitting. 

As all the atoms of D exist in any sparse signal in D, can be represented using The reader may 
ask, if we can use the large dictionary why we need to shrink it to find a dictionary which at best 
can only sparsify the signal to the same level. The answer to this question can be given by noting that, 
finding the sparse approximations have non-polynomial complexity, in a general setting. The success of 
practical sparse approximation algorithms depends on some internal structures of the dictionary, including 
mutual coherence [6], Restricted Isometry Property (RIP) [7] or the null-space property HI. Dictionary 
size indirectly affects these properties such that larger dictionaries mostly make the sparse recovery more 
difficult. Roughly speaking, it is caused by the fact that putting more atoms in the dictionary, the atoms 
become more similar. Such similarities between different atoms, indeed make it more challenging to find 
which set of atoms represents the signals more accurately, i.e the problem of exact (support) recovery. 
The approximation in such large dictionaries would also be noise sensitive, as small noise may cause the 
wrong atoms to become selected. Finally, in coding, the cost of indexing which atoms being used in the 
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representation (x), a.k.a. the binary significance map, grows as the dictionary size increases. 

A. Related Work 

The problem of dictionary design by combining the atoms of a mother dictionary was considered in 
|[9ll . lITOl . In this setting, an auxiliary sparse matrix combines the mother atoms, to generate a dictionary 
which fits the given learning samples. The size of dictionary is fixed here and as the learned dictionary is 
the multiplication of a sparse matrix and a structured matrix (with a possibly fast multiplication), we can 
implement such a dictionary in two steps, where each of them are cheaper than 0{p^). The dictionary 
selection problem can be interpreted as a particular case of sparse dictionary learning, when the sparse 
matrix can have only p < n non-zero elements, with one non-zero on each row. 

The problem of learning a dictionary, when the size of dictionary is not given, has been investigated 
in fUl- The dictionary selection problem has also a similar approach, by finding smaller size dictio- 
naries from the given larger reference dictionaries. The difference is that the reference dictionary is 
fixed throughout the learning here, which allows us to handle significantly larger problems and find 
computationally fast dictionaries. 

The dictionary selection, which will be considered in this paper, is also related to the problem of 
subset selection in machine learning lITll . |[T3l . where the goal is to select the most relevant subset, which 
describes the whole set. ifTll uses the fact that such a model selection can be formulated as a submodular 
cost minimisation. For such a formulation, there exist some canonical solvers, which guarantee to find 
a neighbourhood solution. The derived neighbourhood is indeed not small, which motivated Das and 
Kempe |[T3l to present an alternative submodular formulation to reduce the approximation error. 

B. Contributions 

We here choose a different path to the mentioned dictionary selection techniques in previous section, 
by reformulating the problem as a generalised form of joint sparse representation problem |[T4ll . |[T5l. To 
the authors' knowledge, it is the first time that the dictionary selection problem is modelled in this way. In 
this model, representation of each signal is not only p-joint sparse, it is also A;-sparse in the selected joint 
sparsity support. We here assume p > m, which makes the representation of each signal in the selected 
p-joint support, non unique, where A;-sparsity constraint can help to find the correct representation. 

Based on the new signal model, we need to solve a quadratic objective. As the signal model and the 
objective include unbounded solutions, we need to investigate the conditions that the problem is well- 
defined. Such an analysis is useful for the convergence study of any algorithm introduces to solve the 
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problem. The boundedness and uniqueness of the solutions of the introduced optimisation problem are 
also introduced in this paper. 

As the dictionary can be found using the active rows of the coefficient matrix of the introduced 
optimisation program, we need to practically solve a non-polynomial time complexity problem. We here 
introduce a technique, which is inspired from the iterative hard thresholding for sparse approximations 
mi, ifTTl . to find such an active set of atoms. The algorithm is equipped with a line-search technique to 
guarantee the monotonic decrease of the (positive) objective. The convergence of the algorithm is also 
investigated in this paper. 

In the numerical tests provided in this paper, the new approach is shown to recover the exact dictionary, 
in a large range of sparsity/overcompleteness parameters. 



C. Paper Organisation 

We initially formulate the dictionary selection problem as an overcomplete joint sparse representation 
problem in Section [ll] We then introduce an iterative algorithm to solve the problem approximately in 



Section [III] and show some dictionary recovery results with synthetic data simulation in Section |IV] We 
also show some simulation results on the Curvelet sub-dictionary selection for the finger print data in 
this section. The paper will be concluded in Section [V] 

II. Mathematical Modeling 

Let Y = [yi];g[x l] ^ matrix made by training samples G M™ and $ = [0j]i6x, |X| = n be 
a mother dictionary of normalised atoms (f)i G M™. We assume that the generative dictionary D G 
]R™^P, m < p is made by a subset selection of atoms in i.e. D = [^iljgj- where J d T and 
\J\ = p < n. We assume that each y; is approximately generated by a A;-sparse coefficient vector 7^, 

yi ~ D7i. 

We want to find a dictionary that fulfils the two (apparently contradictory) objectives : few elements in the 
dictionary, and sparsest decomposition for each signal. In other words D which is both small and efficient] 
The problem of optimal dictionary selection can thus be defined as finding the index set J meeting those 
criteria, given Y, p and k. Let X G W^^^ be a coefficient matrix and : 1— )• [l,n] be the 

mapping that assigns the corresponding atom index of $ to the i*^ component of 7^. By assigning 
{x;}^^(,) ^ G V/ G [1,L], while the other elements of X are set to zero, the generative 

model can be reformulated as. 
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Y ^ *X. (1) 

As X is /c-sparse in each column and p-row-sparse, i.e. only p rows of X have non-zero components, it 
lies in the intersection of the following sets, 

/C := {e G M'"^^ : ||0H|o < A:, V/ G [1, L]} (2) 



P:={0GM'"^^: ||0||o,oo<p} (3) 

where ||0||o,oo = \W\\o^ with := ||0'''||oo and 9^''' is the i*'* row of 0. In other words, sets /C and 
V are the sets of m by L matrices which respectively have k non-zero elements on each column and 
p non zero rows. The signals which can be represented using a coefficient matrix in /C n "P, the {k,p)- 
(overcomplete) joint sparse signals or, we will simply say that they follow the (A;,p)-(overcomplete) joint 
sparsity model. 

The optimal dictionary D, which can alternatively be indicated by J', is defined as the solution of the 
following problem, 

rmn||Y - *0|||., s.t. G /CnP. (4) 

D can actually be found using the solution of Q, by selecting the atoms of $ which have been used at 
least once in the representation of Y. 

A. Boundedness and Uniqueness of the Solutions 

The constraint set /C U "P is unbounded. This means that for any given finite value t, there exists at 
least a point X G /CUP such that maxj ||Xj||oo > i- It is necessary to find a condition which guarantees 
the boundedness of the solution of (|4]). Such a condition is given in Lemma [T] To prove this lemma, we 
use the following proposition. 

Proposition 1: Let B'^ be an open ball centred at the origin, with the radius r, defined by B!^ = 
{A G W^^^,maxij |Ajj| < r}. For a given C G M+, if 

Null(*)n/CnP = {0}, (5) 
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there exists a finite radius r G M+ such that, V Xkp G {lCnV)\B^, 

mill \\Xkp -^n\\f > C, 

XjveNull(*) 

Proof: Null($) is a (Unear) subspace of R™^^ and /C n P is a union of subspaces fT8^, which 
intersect at the origin. The shortest distance between a given non-zero point "Kkp in /CnP and Null($) 
is non-zero, as is the only point in Null($) (1 JCnV. This distance becomes larger, if Xj^p moves 
away from the origin with the formula aJ^xp, for a > 1. Therefore, there exists a radius r, which any 
point in /C n "P, located outside of B^, is at least ( away from the closest point in Null($). ■ 
Lemma 1: Let the null space of the operator in the space M™^^, be noted by J\f. The solutions of 
^ are bounded if and only if M nJCnV = {0}. 

Proof: Let X be a solution of Q and X;^^ and X?^ respectively be the projection of X onto the 
null-space and range of As ||X|||, = ||X;v^|||. + ||X7^|||,, we only need to show that ||X;^||i7 and 
11X7^ II p are bounded for any solution of (j4]Q As the matrix € /C nP, any solution of (jiji should then 
have smaller objective than this matrix. We can then have, 

2||Y||f > ||Y||p + ||Y - *X||f 

> II^XIIj. 

> fmmllXT^IIir, 

where cJmm is the minimum (non-zero) singular value of This induces ||X7^||ir < 2cr^^^|| Y||p, which 
is the boundedness of HXT^Hi?. 

We respectively denote A and A as the support index of X, i.e. X.x ^ 0, A G A, and its complement. 
The matrix Aa (respectively A^) is a matrix which is equal to A on the support index (respectively on 
the complement of support index) and zero on the other indices. The solution X is zero on the indices 
specified by A, i.e. X^ = 0. X^ = X^v'a + ^t^a ~ shows that X^v'a = —^Ha- other hand, 

ll^v' ||2 ll"v" l|2 ll"v" l|2 

II^TeAllF — II^TellF ~ II^TeAllF 
<4cT-Ll|Y||l„ 

which assures the boundedness of X^v'a • We finally need to show that X^v'a is also bounded. Momentarily 
assume that Xtv'a is unbounded. Xa = Xt^^ + Xtv'a is in /C n "P and X;^ G M. As Xa is unbounded 

'We here show that Frobenius norm of X is bounded, which induces the boundedness of maxi 1 1 Xi I loo- 
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when X;v^ is unbounded, we can use Proposition [l] with ( = (||X7^^|||^ + ||X_a4|||') ^ as follows, 

< I|Xa - ^atWf 

= II (X7^A + XaTa) - (Xat^^ + XxJ fp 

= W^TIa - ^A^vIIf 

= II^TeAllF + II^A/kllF! 

which contradicts with the fact that = IIX?^^ |||,+||X7v'a III'- Therefore the assumption of unboundedness 
of X^v'a is incorrect, which complete the proof of boundedness of X. 

If ([5]) is not valid, we have a non-zero A € R™^^ in the null space of which is also overcomplete 
joint sparse. This means that any non-zero (A;,p)-joint sparse solution X, with the same support as A, 
can generate another solution of (|4]) by X + AA, A € M. By tending A to infinity, such a solution would 
be unbounded, which shows the necessity of M nJC nV = {0}. ■ 

Although this lemma shows the boundedness of the solutions, it does not provide any explicit bound 
for the results. It means that if the Null($) subspace is very close to one of the subspaces in /C n "P, C 
can become very large. 

The reader may noticed that we did not use the optimality of X, in the proof of Lemma [T] Instead, 
we used the fact that the objective at X is less than the objective at 0. Therefore we can easily extend 
this lemma, to derive the boundedness of the search space. 

Corollary 1: The set {0 G /C n ||Y - ^Q\\f < W^Wf} is bounded if Q is true. 

It is always useful to know when an optimisation problem like ([5]), has a unique solution. This is 
particularly useful in the dictionary design problem, as the other formulations has often multiple solutions. 
This is caused by the fact that any permutation of a dictionary is also a solution for the problem. This 
indeed makes the convexification of the problem much more challenging. 

We present a sufficient condition for the uniqueness of solution in the following lemma, which is a 
particular case of the uniqueness results for the Union of Subspaces (UoS) model |18|. 

Lemma 2: Let k < ^, p < §, = Null{$} and IC2k ■= {X G C™^-^ : ||x/||o < 2k,yi £ [l,L]} 
and V2p ■■= {X € C™^-^ : ||X|| 

0,00 ^ 2p} The optimisation problem (j4jl has a unique solution if 

MnlC2kriV2p = {0} (6) 

Proof: Let the solution not be unique and we have Xi and X2 as two distinctive solutions of Q. We 
have *Xi = *Xi = Y, which means *(Xi - X2) = 0. As Xi - X2 G lC2k n V2p and Xi - X2 G Af, 
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Fig. 1. TZ for different n, when 5 — j, p = and t = 100. / is shown by the solid line and the bounds for TZ are shown 
with error bars, see |To}. 

it should be 0, which gives Xi = X2 and it contradicts with the fact that they are distinctive solutions. 

■ 

Remark 1: Note that Lemma |2] presents a sufficient condition for the uniqueness of the solution, which 
is different to the standard k-sparse and k-joint sparse UoS models. Similar to the general form of block- 
sparse model, this is caused by the fact that some of the sparsity patterns in K,2k^'P2p can not be divided 
to two disjoint sparsity patterns in /C n P. 

Remark 2: The boundedness of the solutions of (|4]) needs a weaker condition than its uniqueness. We 
can actually use the uniqueness condition of Lemma |2] to show the boundedness of the solution. 

B. Number of Subspaces 

It was mentioned that the introduced signal model is a UoS model, as fixing the support coefficient, 
generates a low-dimensional subspace of the W^^^. We are restricting the set of matrices which are 
/c-sparse on each column, to the matrices which are also p-joint sparse. Such a restriction reduces the 
number of admissible subspaces, which we quantify it here. When the matrix is /c-sparse on rows, we 
have L times (^) options to choose the support. The number of subspaces is thus If we also restrict 
the matrices to be p-joint sparse, we choose k positions for each row, within the selected p rows. We 
have therefore (^)^(p) subspaces. To quantify exponential reduction in the number of subspaces using 
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the (/c,p)-joint sparsity model, we approximate TZ, defined as, 

n := log2 T . (7) 

a)' 

To find some upper and lower bounds for TZ, we use the concept of binary entropy H. from Information 
Theory, which is defined as follows, 

nir) ^ -rlog2(r) - (1 - r) log2(l - r) (8) 

where < r < 1 is the probability of a binary number. We can now bound (^) as follows |fT9l eq. 
12.40], 

1 ^^'u(k\ _ (n 



2««(^t)<r <2"^(^). (9) 



n + 1 \k 

Using the similar bounds for (j^) and (^), and after some simple algebraic manipulations, we can derive 
a bound for Ti, as follows, 

-Llog2(p+ l)-log2(ra + 1) + f{k,p,n,L) 

(10) 

<n< Llog2(n + 1) + f{k,p,n,L), 



where f{k,p, n, L) = n% (2) + LpH ^| j — nLH. (^) If we replace the binary entropy in f{k,p, n, L), 
we can derive an expUcit formulation for / as follows, 

f{k,p,n,L)= (plog2^^'^ hplog2"^ — ^ 

V n — p p 

/ ^^^^ 

- L nlog2 -lJlog2 r + fcloga 



n — k p — k p — k 

As ([To]) depends on many parameters, it is hard to figure out the reduction in the number of subspaces 
form TZ. To demonstrate this better, we can fix 6 = ^, /? = - and t = ^, and plot / based on n, 



which is an approximation for TZ, and showing the bounds of (10 1 with some error bars. If we choose 
6 = \, = and t = 100, the bounds for TZ are plotted as functions of n in Figure [Tj As 2^ is the 
ratio between the number of subspaces in the new model and fc-sparsity model, we can see that ratio is 
significantly reduced for large n. In other words, the search space for the solution is now much smaller, 
which may boost the exact recovery using practical recovery algorithms, as we can see in the simulation 
section. 
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III. Optimal Dictionary Selection 

Although the objective of ([4]) is quadratic, the optimisation of (|4]) subject to the non-convex constraints 
K, and V, is not easy. Most of the efficient optimisation techniques can not be used in this setting. A 
powerful technique, called the projected gradient, can be used when the projection onto the admissible 
set is available. In the space of real matrices W^^^ , the projection of a point X G R™^^ onto a closed 
set K C ]R™x-^ is defined by ^^(X) := argmin@gK ||0 — X||^, where || • ||;^ is the norm of the proposed 
space. We use the Hilbert - Schmidt, or Frobenius, norm here, as it is more related to the quadratic 
objective Q, i.e. using the same normed space, and we can analytically find the projection. In this 
setting, a projection onto K, can be found by keeping the k largest coefficients of each column and letting 
the others be zero. The projection onto V can be found by keeping the p rows of X with the largest 
maximum absolute values and letting the other rows be zero. Sadly, the projection onto the intersection 
of /C and V is not analytically possible, the projected gradient algorithm can not be used in its canonical 
form. A property of the admissible sets K, and V is that the consequent projections of a point in these 
sets provide a point in the intersection of them, which may indeed not necessarily be the projection 
onto ICnV. The following lemma shows that alternating projection onto /C and V converges in a single 
alternating projections. 

Lemma 3: Let X be a bounded matrix in R™^^. The following two statements hold, 

(12) 

Proof: Projections of X onto K, or V shrinks some of X's non-zero elements to zero and does not 
produce any further non-zero elements. This simply shows that the projection of a point in /C, onto V, 
gives a new point which is still in JC. It assures the first statement. The second statement can be shown 
similarly. ■ 
Remark 3: The sets JC and V are non-convex and the projection onto each of these sets may thus be 
non-unique. In this case we can randomly choose one of the projections. 

A. Optimal Dictionary Selection Algorithm 

We use a gradient based method which iteratively updates the current solution X'"!, in the negative 
gradient direction and maps onto a point in /CnP, to approximately solve If ip(X.) := ||Y — $X|||,, 
the gradient of ip can be found as follows. 
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G := J^V(X) = 2$^ (*X - Y) . (13) 

An important part of the gradient descent methods, is how to select the step size. An efficient step 
size selection technique for unconstrained quadratic minimisation problems, with objectives like V'(X), 
is to use half of the spectral radius of hnear operator, here as follows, 

^ ~ 2 G^G 

Such a step size is optimal for the first order gradient descent minimisation of the unconstrained 
problem with the quadratic objective ip(X.). In a constrained minimisation scenario, we can choose a 
similar initial step size and shrink the size, if the objective increases. It thus needs an extra step to check 
that the objective is actually not increased after each update of the parameters. A more clever initial 
step size was selected in ifTTl for the sparse approximations of fc-sparse signals. If the support of sparse 
coefficient vectors are fixed, i.e. the overall projection steps do not change the support, the update is 
only in the direction which changes current non-zero coefficients. When the problem size is shrunk to 
the space of current support, the problem is quadratic and the step size can be similarly calculated using 
the gradient matrix G, constrained to the support, as follows, 

1 Gf *^*G5 

^ " 2 G^Gs 

where G^ G W^^^ is the gradient matrix G masked by the support of X, S, as follows. 



{Gs}i,j 



{GK, {XK,/0 
Otherwise 



A pseudo-code for the algorithm is presented in Algorithm [T] The condition which is checked in line 
[TOl guarantees that the algorithm reduces the objective by updating the coefficients. As the dictionary 
selection algorithm [T] is based upon a gradient projection type technique, the learned dictionary may be 
more suitable for such greedy sparse approximation techniques. However, the simulation results show 
that the reference dictionary can be recovered using this algorithm, given a rich set of training samples. 
If the real signal is sparse and the dictionary satisfies the exact recovery conditions, the dictionary is thus 
optimal for any sparse recovery algorithms. 

In the following theorem, we prove that Algorithm [T] is numerically stable and the generated sequence 
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1: initialisation: X^, S = supp {V/c [Vv (*^Y))), p < 1, /? < 1, e < 1, t = 0, iiT > 1 and 
2: while i<Kk.t^ldo 
3: G = 2*^ (*X[^1 - Y) 

P - 2 G^Gs 
5: Z = P/c(7'p (xH -/xG)) 

6: if ||XH - Z||2 < e then 

7: t = l 

end if 
9: if t / 1 then 

2||*(XW-Z)|P^ 



10: while /i > f JSw^fe do 



11: ^J' = |3■^J- 

12: Z = Px; (T^p (XW -/xG)) 

13: end while 

14: end if 

15: Z = i + 1 

16: XW = Z 

17: 5 = supp(xH) 

18: end while 

19: X* = X['-y 

20: output: X* 

Algorithm 1: Alternating Projected Gradient for Optimal Dictionary Selection 



has limit points. 

Theorem 1: Let X^*^! G W^^^ be a bounded initial point. The gradient based method of Algorithm [l] 
generates a bounded sequence of solutions, which accumulate. 

Proof: As the algorithm reduces the objective at each iteration, the search space is a bounded subset 
of /C n "P, based upon Corollary [T] /C n "P is a closed set, the search space is then a compact subset of 
j^mxL rpj^g sequence generated by Algorithm [l] lives in a compact set, which is enough to have bounded 
accumulation points, based on the Bolzano-Weierstrass theorem. ■ 

IV. Simulations 

In the first experiment, a dictionary $ G ]^20x80 randomly generated using a normal zero mean 
distribution with unit variance and normalised to have unit ^2-norm on each column. The target dictionary 
D € ]R20x30 generated by randomly selecting j» = 30 atoms of A number L = 320 of A;-sparse 
coefficient vectors (with k = 4), were generated by randomly selecting the support, with a uniform 
distribution of the magnitudes in [0.2, 1] and random signs. A set of training matrix Y of length 
L were generated using the generative model and randomly generated sparse vectors. To recover the 



13 



.E 40 
E 



(a) 



«iVi,r'iiiivfiV«iini\i|n,i(iii(ivmT 



(b) 



201 

30 I 



illllTlffl'limVU'Mliiilllll'™ 

20 iiyrriiii'y:wiii'i'nw't"iii'ii'it 



:''::!5WillMi;(i 

iiy/i'iir'niivii'miiiBiiUuiini 



200 400 600 200 400 600 200 400 600 

Coefficient vectors indices Coefficient vectors indices Coefficient vectors indices 



Fig. 2. Optimal Dictionary Selection using, (a) K, (b) V and (c) JCCiV as admissible sets. The black dots in each plot indicate 
non-zero coefficients. In plot (b), as dots are very populated, we observe solid horizontal lines. Gray horizontal lines are plotted 
as a guideline, for the correct dictionary. 




Fig. 3. Phase transition using, (a) IC, (b) V and (c) ICnV as admissible sets. The black area indicates successful recovery of 
the dictionary. 



optimal generative dictionary D, given Y, p and k, we used a gradient descent based algorithm similar 
to Algorithm [T] with three different admissible sets, and demonstrate the superiority of the proposed 
technique. We first used IC from ([2]) and no constraint on the row-sparsity of the coefficient matrix X and 
showed the recovered support of the sparse matrix in the left panel (a). If we only assume joint sparsity 
model and use V from ^ as the admissible set, we find the coefficient matrix whose support is shown 
in the middle panel (b). Using both constraint sets, as explained in Algorithm [T| provides a coefficient 
matrix whose support is shown in the right panel (c). The correct J' is shown in these plots using some 
grey lines. It is clear that the proposed projected gradient onto both sets can correctly recover J', where 
the other two methods have some errors in the recovery. 

This experiment can be repeated for different S = ^ and /O = ^ by selection a range of p and k's, 
while keeping m and n fixed. If we repeat the simulations 100 times for each setting and calculate the 
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average exact dictionary recovery, we can plot the phase transition for each methods. We have plotted 
such phase transitions in Figure |3j with k sparsity constraint in (a), p joint sparsity constraint in (b) and 
proposed constraint in (C). The black colour means high exact dictionary recovery. The area with exact 
recovery in (c) is larger than the same areas in (a) and (b) added together. This clearly demonstrates the 
relevance of the new framework. 

In the next set of experiments, we will select a subset of the Curvelet |[20ll dictionary for the sparse 
representation of fingerprints. We chose a Curvelet transform for the image size 64 by 64. The mother 
dictionary $ G ]^4096x 10521 roughly 2.59 times overcomplete, which we want to shrink to half size, 
i.e. D G ]^4096x5260 ^pj^jg indeed a large scale dictionary learning problem, which is difficult to solve 
in a standard dictionary learning setting. With the help of the proposed method, we can handle such a 
big dictionary selection process, as we need fewer training samples, only need to keep a sparse matrix, 
i.e. sparse representation matrix, in the memory and use the fact that the mother dictionary has a fast 
implementation. We assume the sparsity of each image patch is A; = 1052 ^ O.IN and L = 64. We used 
two different settings here to choose the dictionary, a) p-joint sparsity model and b) (A;,p)-overcomplete 
joint sparse model. The simulations were done in the Matlab environment, on a 12-core, 2.6 GHz linux 
machine, which respectively took 72 and 90 seconds to learn Dp and D(;j p). Another fingerprint image 
was used to test the selected dictionaries. The original image and the k sparse representation of the 
original image with $ are shown in the first row of Figure |4] The fc-sparse representation with the 
learned D's are shown in the second row of this figure. The left image is the representation with learned 
Dp, when the model was p-joint sparse and the right image is the same, but with D(fcp), where the 
(A;,p)-overcomplete joint sparse model was incorporated. As we can see the PSNR of the representation 
with the shrunk dictionary ^(k,p) is slightly better than the other. We can also see the bottom-right quarter 
of these images in Figure [5] in the same order. The sparse reconstructed images are actually denoised 
and the reconstructed image using ^(k,p) is more similar than Dp to the image reconstructed using 

V. Summary and Future work 

We presented a new technique for dictionary selection for the linear sparse representation, when a 
collection of possibly suitable atoms and some exemplar signals are available. The dictionary selection 
problem is reformulated as a more general form of the joint sparse approximation problem, when 
the number of active locations in sparse coefficients is larger than the size of signal space. As such 
overcomplete joint sparsity framework has generally infinitely many solutions, the sparsity within the 
active set helps to regularise the problem. It was shown that the overcomplete joint sparse approximation 
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Original Dictionary Sparse Image Approximation 
PSNR = 41,366 
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Fig. 4. The original image (top left), the k sparse representation of the original image with the dictionaries, $ (top right), D. 
(bottom left) and D(fc_p) (bottom right). 




Fig. 5. The bottom-right quarter of the images shown in Figure |4] in the same order. 

problem is well-defined under some conditions on the null-space of the matrix generated by the given large 
set of atoms (mother dictionary). As the objective of the introduced program is continuously differentiable, 
we used a gradient mapping technique to approximately solve the problem. The introduced algorithm 
converges in a weak sense (convergence to a bounded non-empty set). 
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We presented some synthetic data simulation result to support this hypothesis that the introduced 
algorithm can recover the original dictionary. The phase plot of the dictionary recovery is compared with 
two other cases, when we use other sparsity models, namely fe-sparse and p-joint sparse model. As the 
simulations with synthetic data were promising, we also did some simulations to select a subset of a 
commonly used dictionary, Curvelet dictionary, to reduce the complexity of the sparse coding algorithm. 
The size of dictionary learning problem is such that it cannot be handled by the vast majority of current 
dictionary learning algorithms. As we do not need to keep the dictionary in the memory and as the 
dictionary-vector multiplications can be implemented very efficiently, the learning in the new framework 
is relatively easy. The results show that we can roughly get the same image quality for a specific class 
of images, when we use a smaller dictionary than the mother dictionary. 

The new overcomplete joint sparsity model seems an interesting extension of the previously investigated 
joint sparsity model. We have left the theoretical investigation of exact recovery and other sparse signal 
processing applications, for the future work. 
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